Nonlinear principal component analysis
نویسندگان
چکیده
We study the extraction of nonlinear data models in high dimensional spaces with modi ed self organizing maps We present a general algorithm which maps low dimensional lattices into high dimensional data manifolds without violation of topology The approach is based on a new principle exploiting the speci c dynamical properties of the rst order phase tran sition induced by the noise of the data Moreover we present a second algorithm for the extraction of generalized principal curves comprising dis connected and branching manifolds The performance of the algorithm is demonstrated for both one and two dimensional principal manifolds and also for the case of sparse data sets As an application we reveal cluster structures in a set of real world data from the domain of ecotoxicology Introduction One of the major objectives of data analysis is the extraction and instructive represen tation of the relevant information contained in the data In cases of practical interest data are given by high dimensional vectors corrupted by noise Dimension reduction and elimination of noise then is the essential step in analyzing the data Principal component analysis PCA is one of the most prominent tools in this process By uncovering the principal components of the data distribution PCA creates a lower dimensional subspace which contains the relevant information on the data Although highly successful in typical cases PCA su ers from the drawback of being a linear method By way of example consider the globe with the locations of cities as data points PCA would discover in this data set three principal components so that there is no complexity reduction in the description of the data On the other hand a topographic map of the globe provides a two dimensional representation which can be analysed successfully using conventional methods like PCA P O Box Leipzig Germany e mail der informatik uni leipzig de or ul rich mis mpg de The location of cities on the globe forms a nonlinear data manifold The above example suggests a two step treatment by rst mapping the nonlinear data set onto a linear lower dimensional manifold and using conventional methods after However real world data manifolds besides of being nonlinear often are corrupted by noise and embed into high dimensional spaces The present paper will present general procedures for nding the optimal mappings in this general case Looking into the opposite direction the map can also be seen as embedding a low dimensional manifold M a regular lattice say into the higher dimensional data manifold M is called a principal manifold PM of the data if it provides an op timized in some sense representation of the data A convenient choice is de ned self consistently by the requirement that each point on the PM is the average of the data points projecting to it cf Thus it minimizes the mean square deviations of the data from the PM subject to some smoothness constraint The present contribution provides general procedures for nding such principal manifolds for arbitrary data sets Principal curves Let us now consider the problem of nding principal manifolds in some detail We will restrict to the case of principal curves since this does already show the full complexity of the problem De nition of principal curves Let us consider a data set X with data v v vn Rn A principal component P describes a data set X as linear function f of a single parameter i e v X is represented by f v v c c P Given X the vectors c and c are determined by minimizing the reconstruction error c c c E c E where E Z X kv c c k P v dv P v being the probability distribution of the data Eqs imply that the distance kv wk is minimal with respect to variations of w along P In other words the projection of a data point v is given by its closest point on the principal component A principal curve PX is a generalized principal component in that a large class of nonlinear smooth vector valued function f R is allowed for the representation of the data The projection v of a data point v onto the curve is again de ned by the value of for which f is closest to v v argmin v f In general each point segment in the discrete case of the curve is the projection of a subset of data points called its projectors In the linear case the projections agree with the center of gravity of the projectors In this sense one may say that the principal component is running through the middle of its data points This de nition can be carried over to the nonlinear case Hence a principal curve is de ned as running through the center of gravities of the points projecting to it see Fig One may also say that each point on the PC is required to be the average of its projections This can be
منابع مشابه
Neuralnets for Multivariate And Time Series Analysis (NeuMATSA): A User Manual (Version 5.0) MATLAB codes for: Nonlinear principal component analysis Nonlinear canonical correlation analysis Nonlinear singular spectrum analysis
متن کامل
Neuralnets for Multivariate And Time Series Analysis (NeuMATSA): A User Manual (Version 2.3.1) MATLAB codes for: Nonlinear principal component analysis Nonlinear canonical correlation analysis Nonlinear singular spectrum analysis
متن کامل
Neuralnets for Multivariate And Time Series Analysis (NeuMATSA): A User Manual (Version 5.0.1) MATLAB codes for: Nonlinear principal component analysis Nonlinear canonical correlation analysis Nonlinear singular spectrum analysis
متن کامل
Principal component analysis or factor analysis different wording or methodological fault?
This article has no abstract.
متن کاملProbabilistic Analysis of Kernel Principal Components
This paper presents a probabilistic analysis of kernel principal components by unifying the theory of probabilistic principal component analysis and kernel principal component analysis. It is shown that, while the kernel component enhances the nonlinear modeling power, the probabilistic structure offers (i) a mixture model for nonlinear data structure containing nonlinear sub-structures, and (i...
متن کاملNonlinear Principal Component Analysis
A. Two quite different forms of nonlinear principal component analysis have been proposed in the literature. The first one is associated with the names of Guttman, Burt, Hayashi, Benzécri, McDonald, De Leeuw, Hill, Nishisato. We call itmultiple correspondence analysis. The second form has been discussed by Kruskal, Shepard, Roskam, Takane, Young, De Leeuw, Winsberg, Ramsay. We call it no...
متن کامل